Text Clustering to Support Knowledge Acquisition from Documents
نویسنده
چکیده
At the earlier stage of the knowledge acquisition process, interviews of experts produce a large amount of rich but ill-structured texts. Knowledge engineers need some tool to help them in the exploitation of all these texts. We propose the use of a statistical method, the top-down hierarchical classification and a new interpretation of its results. The initial statistical analysis proposed by M. Reinert (Reinert, 1979 and 1992) gives two kinds of results: first a segmentation of texts that reflects their «semantic contexts» that we use to raise structures of texts, and second, classes of significant terms belonging to these contexts, which can be related to the experts or to these specialities. In this paper, we describe the method, its empirical validity and its comparison with similar approaches, its uses with examples and results. We conclude with some research directions to deal with so-called "ontologies" on expert’s domains. Key-words: hierarchical top-down classification, statistical text analysis, text segmentation, text structure discovery, semantic context. * Email: [email protected] Agrégation de segments de texte pour l’aide à l’acquisition de connaissances à partir de documents Résumé : Dans les premières étapes du processus d’acquisition des connaissances, une grande quantité de textes riches en expertise, mais sans structure est produite. Le cogniticien a alors besoin d’un outil pour exploiter tous ces textes. Nous proposons l’utilisation d’une méthode statistique, la classification descendante hiérarchique et une nouvelle interprétation de ses résultats. Cette analyse statistique telle qu’elle a été proposée par Max Reinert (Reinert, 1979 and 1992) donne deux sortes de résultats : premièrement une segmentation des textes qui reflète leurs «contextes sémantiques», que nous utilisons pour mettre en évidence la structure des textes, et deuxièmement, un ensemble de classes de termes attachés à ces contextes, qui peuvent servir à la caractérisation des experts ou de leurs spécialités. Dans ce rapport, nous décrivons la méthode, sa validité empirique, les approches similaires, ainsi que son utilisation avec quelques exemples et résultats. Nous concluons sur des directions de recherche pour traiter les «ontologies» sur des domaines d’expertise. Mots-clé : classification descendante hiérarchique, analyse statistique de texte, segmentation de texte, découverte de la structure de texte, contexte sémantique. Text Clustering to Support Knowledge Acquisition from Documents 1 Text Clustering to Support Knowledge Acquisition from Documents Stéphane Lapalut ACACIA project, INRIA Sophia Antipolis, BP 93, 06 902 Sophia Antipolis cedex, France [Sté[email protected]]
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملA Latent Semantic Indexing-based approach to multilingual document clustering
The creation and deployment of knowledge repositories formanaging, sharing, and reusing tacit knowledgewithin an organization has emerged as a prevalent approach in current knowledge management practices. A knowledge repository typically contains vast amounts of formal knowledge elements, which generally are available as documents. To facilitate users' navigation of documents within a knowledge...
متن کاملGrid-enabled Support for Classification and Clustering of Textual Documents
This paper presents the fusion of two approaches – Grid and Grid computing and text mining. GridMiner is a system developed at University of Vienna, and it is a framework for knowledge discovery process in the distributed Grid environment. JBOWL is a framework for text mining and information retrieval being developed at Technical University in Košice. Text mining provides some methods (includin...
متن کاملA Hybrid Method for Manufacturing Text Mining Based on Document Clustering and Topic Modeling Techniques
As the volume of online manufacturing information grows steadily, the need for developing dedicated computational tools for information organization and mining becomes more pronounced. This paper proposes a novel approach for facilitating search and organization of textual documents and also extraction of thematic patterns in manufacturing corpora using document clustering and topic modeling te...
متن کامل